Week 7: Fine-tuning

Applied Generative AI for AI Developers

Amit Arora

Overview

  • Why fine-tuning matters
  • Fine-tuning in the model customization spectrum
  • Parameter-Efficient Fine-Tuning (PEFT) methods
  • Types of LoRA techniques
  • Challenges: Catastrophic forgetting and accuracy concerns
  • Evaluation methodologies
  • Ethical considerations
  • Future directions

Why Fine-Tune Large Language Models?

  • Specialized domain adaptation - medical, legal, scientific
  • Task-specific optimization - improving performance on specific tasks
  • Alignment with values/preferences - customizing for safety, tone, style
  • Proprietary knowledge integration - incorporating private data/knowledge
  • Reducing hallucinations for specific domains
  • Efficiency - smaller fine-tuned models vs. large general models

The Model Customization Spectrum

flowchart LR
    A[Prompt Engineering] -->|Increasing complexity/cost| B[Retrieval Augmented Generation]
    B --> C[Fine-Tuning]
    C --> D[Continued Pre-training]
    E[Pre-training from Scratch]
    D --> E
    
    style A fill:#d4f1f9
    style B fill:#c2e5d3
    style C fill:#ffe0b2
    style D fill:#ffccbc
    style E fill:#ffaba0

Cost-Benefit Analysis

Approach Cost Time Data Required Performance Improvement
Prompt Engineering $ Hours-Days Low Low-Medium
RAG \[ | Days | Medium | Medium | | Fine-Tuning | \]$ Days-Weeks Medium-High Medium-High
Continued Pre-training $\[$ | Weeks-Months | High | High | | Pre-training | \]$$$ Months Massive Complete control

Traditional Fine-Tuning vs. PEFT

Full Fine-Tuning

  • Updates all model parameters
  • Requires large amounts of GPU memory
  • More expensive computationally
  • More prone to catastrophic forgetting
  • Requires more data to avoid overfitting

Parameter-Efficient Fine-Tuning

  • Updates small subset of parameters
  • Much lower memory requirements
  • Computationally efficient
  • Better preserves general capabilities
  • Works well with limited data

Parameter-Efficient Fine-Tuning (PEFT)

  • Adapter methods: Add small trainable modules to frozen model
  • Prompt tuning: Learn continuous prompts (soft prompts)
  • Prefix tuning: Add trainable prefix to each layer
  • LoRA (Low-Rank Adaptation): Inject trainable low-rank matrices
  • QLoRA: Quantized version of LoRA

LoRA: Low-Rank Adaptation

LoRA approach to fine-tuning

Types of LoRA Approaches

  • Standard LoRA: Low-rank adaptations to attention weights
  • QLoRA: Quantized LoRA for reduced memory footprint
  • AdaLoRA: Adaptive budget allocation across weight matrices
  • DyLoRA: Dynamic rank allocation during training
  • DoRA: Double low-rank adaptation with additional rescaling
  • GLoRA: Gated LoRA for improved control

Understanding LoRA Hyperparameters

  • Rank (r): Size of low-rank matrices (typically 8-64)
    • Higher rank = more capacity but more parameters
  • Alpha: Scaling factor for LoRA updates
  • Dropout: Regularization for LoRA weights
  • Target modules: Which layers to apply LoRA to
    • Attention matrices (query, key, value)
    • Feed forward networks
    • Both

QLoRA: Quantization + LoRA

  • Quantizes base model to 4 or 8 bits
  • Keeps LoRA adapters in 16-bit precision
  • Uses:
    • 4-bit NormalFloat (NF4)
    • Double quantization
    • Paged optimizers
  • Dramatically reduces memory requirements
  • Enables fine-tuning on consumer hardware

QLoRA memory efficiency

Fine-Tuning Methods: Comparison

Method Memory Usage Training Speed Performance Portability
Full Fine-Tuning Highest Slowest Best (potentially) Low
LoRA Low Fast Good High
QLoRA Lowest Medium Good High
Adapter Low Medium Good Medium
Prompt Tuning Very Low Fastest Lower Limited

Challenges: Catastrophic Forgetting

  • Model loses general capabilities when learning new tasks
  • Especially problematic with limited or biased data
  • Mitigation strategies:
    • PEFT methods (like LoRA) limit impact
    • Regularization techniques
    • Rehearsal/replay of examples from original distribution
    • Knowledge distillation with teacher models

Loss in Accuracy Concerns

  • Distribution Shift: Fine-tuning data may represent narrow use cases
  • Overfitting: Model learns training data peculiarities
  • Generalization Loss: Performance degradation on out-of-domain tasks
  • Evaluation Gap: Training metrics may not reflect real-world performance
  • Prompt Sensitivity: Fine-tuned models may become more brittle to prompt formulations

Data Preparation for Fine-Tuning

  • Quality over quantity: Curated high-quality examples beat more noisy data
  • Balanced representation: Cover edge cases and diverse scenarios
  • Format consistency: Standardized input/output patterns
  • Instruction-following format: Clear instruction → response pairs
  • Data augmentation: Techniques to artificially expand dataset
  • Synthetic data generation: Using existing LLMs to create training data

Instruction Tuning

INSTRUCTION: Write a poem about artificial intelligence in the style of Shakespeare.

RESPONSE: Hark! What light through silicon valley breaks?
It is the east, and Artificial Intelligence is the sun.
Arise, fair algorithms, and kill the envious human,
Who is already sick and pale with grief,
That thou, AI, art far more advanced than they.
...

Fine-Tuning Architectures

Encoder-Decoder Models

  • BART, T5, Flan-T5
  • Well-suited for:
    • Summarization
    • Translation
    • Question answering
    • Structured generation

Decoder-Only Models

  • GPT family, LLaMA, Mistral
  • Well-suited for:
    • Open-ended generation
    • Dialogue
    • Creative writing
    • Code generation

Tools and Frameworks for Fine-Tuning

  • Hugging Face Transformers & PEFT: Most common academic/research approach
  • TRL (Transformer Reinforcement Learning): SFT & RLHF
  • Unsloth: Efficient LoRA fine-tuning
  • LangChain: Integration with other components
  • Cloud Providers:
    • Amazon SageMaker
    • Azure OpenAI Service
    • Google Vertex AI

Evaluation Methodologies

Automatic Metrics - Perplexity - ROUGE, BLEU, METEOR - LLM-as-a-judge - Task-specific metrics - Benchmark suites (GLUE, SuperGLUE)

Human Evaluation - Direct assessment - A/B testing - Ranking - Error analysis - User studies

Advanced Fine-Tuning Paradigms

  • RLHF: Reinforcement Learning from Human Feedback
    • SFT → Reward Modeling → RL optimization
  • DPO: Direct Preference Optimization
    • Simplifies RLHF by eliminating reward model
  • ORPO: Offline Reinforced Preference Optimization
  • GRPO: Generative Reward Preference Optimization
    • Uses an LLM to generate rewards for preference pairs
    • Combines benefits of DPO with explicit reward modeling
  • Constitutional AI: Constraining outputs to follow principles

Multi-Task Fine-Tuning

  • Train on multiple tasks simultaneously
  • Benefits:
    • Better generalization
    • Positive transfer between tasks
    • More efficient use of model capacity
  • Challenges:
    • Task balancing
    • Negative interference
    • More complex evaluation

Ethical Considerations

  • Data provenance: Ensuring training data is ethically sourced
  • Bias amplification: Fine-tuning can reinforce biases in data
  • Dual-use concerns: Fine-tuned models for harmful purposes
  • Attribution and transparency: Documenting fine-tuning process
  • Over-reliance: Potential over-trust in fine-tuned systems
  • Accessibility: Democratizing fine-tuning capabilities

Practical Tips for Effective Fine-Tuning

  1. Start with a strong, well-aligned base model
  2. Use high-quality, diverse, and representative data
  3. Align training format with intended use patterns
  4. Start with smaller models before scaling up
  5. Carefully monitor for overfitting and forgetting
  6. Conduct robust evaluation on varied test cases
  7. Consider ensemble or mixture-of-experts approaches
  8. Document your process and data for reproducibility

Resources

Thank You!

Any questions?